Preparations

Load the necessary libraries

library(mgcv)      #for GAMs
library(broom)     #for tidy results
library(gratia)    #for GAM plots
library(DHARMa)    #for residual diagnostics
library(performance) #for residuals diagnostics
library(see)         #for plotting residuals
library(emmeans)   #for marginal means etc
library(MuMIn)     #for model selection and AICc
library(tidyverse) #for data wrangling
theme_set(theme_classic())

Scenario

This is an entirely fabricated example (how embarrassing). So here is a picture of some Red Grouse Chicks to compensate..

Red grouse chicks

Format of data_gam.csv data file

x y
2 3
4 5
8 6
10 7
14 4
x - a continuous predictor
y - a continuous response

Read in the data

data_gam <- read_csv('../data/data_gam.csv', trim_ws=TRUE)
## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_double()
## )
glimpse(data_gam)
## Rows: 5
## Columns: 2
## $ x <dbl> 2, 4, 8, 10, 14
## $ y <dbl> 3, 5, 6, 7, 4

Exploratory data analysis

p <- data_gam %>%
  ggplot(aes(x, y)) +
  geom_point()
p +
  geom_line(aes(col=y)) +
  geom_smooth(se=F, col="red") +
  geom_smooth(method = "lm", se=F, col="green") +
  geom_smooth(method = "lm", formula="y ~ poly(x, 2)", se=F, col="yellow")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 1.94
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 6.06
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 36.724
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : Chernobyl! trL>n 5
## Warning in sqrt(sum.squares/one.delta): NaNs produced
## `geom_smooth()` using formula 'y ~ x'

(g1 <- p + geom_smooth(method = "gam", formula = y~s(x, k=3), se=T, col="black"))

GAM by far looks the best. We can inspect the basis functions specifically using the following:

(g2 <- basis(s(x, k=3), data=data_gam) %>% draw)

require(patchwork)
## Loading required package: patchwork
g1 / g2

To see other basis functions and their respective basis function definitions:

basis(s(x, k=3, bs = "cr"), data=data_gam) %>% draw

smoothCon(s(x, k=3, bs = "cr"), data=data_gam)[[1]]$X # view the basis function equation itself
##             [,1]      [,2]        [,3]
## [1,]  1.00000000 0.0000000  0.00000000
## [2,]  0.59259259 0.4814815 -0.07407407
## [3,]  0.00000000 1.0000000  0.00000000
## [4,] -0.09259259 0.8518519  0.24074074
## [5,]  0.00000000 0.0000000  1.00000000

Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + f(x_i)\\ f(x_i) = \sum^k_{j=1}{b_j(x_i)\beta_j} \]

where \(\beta_0\) is the y-intercept, and \(f(x)\) indicates an additive smoothing function of \(x\).

Fit the model

gam_mod <- gam(y ~ s(x, k=3), data = data_gam, method = "REML")

Model validation

par(mfrow = c(2,2))
gam.check(gam_mod, pch=19)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 5 iterations.
## Gradient range [-4.151024e-06,2.176788e-06]
## (score 6.024247 & scale 0.4120262).
## Hessian positive definite, eigenvalue range [0.2676093,1.6827].
## Model rank =  3 / 3 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##        k'  edf k-index p-value
## s(x) 2.00 1.95    2.19    0.98
par(mfrow = c(1,1))

k.check(gam_mod) # gives the summary values without the plots
##      k'      edf  k-index p-value
## s(x)  2 1.949003 2.186796  0.9825

k’ is the df of the basis functions (we asked for 3, so k-1 df) edf is the estimated df of the knots, so if it is similar to k’, non-significant, otherwise, it might be much less than k’ if you’ve made too complex a model.

k-index is how consistent the residuals are distributed across the curve. Is essentially the ratio of individual residuals vs. the average of the absolute residuals. With higher k-indexes, suggests a bad fit.

p-value indicates if there is any evidence of over-constraining the model. If the model requires more complex fits, this value will be less than 0.05.

Check the model’s fit:

=> Evidence of overconstraining basis functions:

  • Is the k-index low?
  • Is the associated p-value less than 0.05?
  • Are k’ and edf similar?

=> Evidence of overly-complicated models:

  • Is the data well distributed across the spectrum?
  • Is the edf close to 1?
  • Is the p-value non-significant?
  • Does the plotted relationship seem odd?

Using gratia (for same plots, but using ggplot2)

appraise(gam_mod)

resids <- simulateResiduals(gam_mod, plot = T)

Instead of colinearity, we have another assumption for GAMs, specifically:

How well does a specific basis or smoother term approximate another smoother?

This is called the ‘concurvity’ assumption. We use the concurvity() function (from mgcv) for this assessment.

Concurvity ranges from 0 to 1 and assesses whether the basis functions have similar weights. If they do have similar weights, they will have a high concurvity (close to 1).

concurvity(gam_mod)
##                  para         s(x)
## worst    3.194887e-30 3.487872e-30
## observed 3.194887e-30 3.148783e-30
## estimate 3.194887e-30 2.393887e-31

This produces three outputs:

  • ‘worst’ is the worst-case scenario, can usually ignore this.
  • ‘observed’ is too optimistic, will be too high, so ignore this one as well.
  • ‘estimate’ is the most useful and likely to be the most accurate.

It is very difficult to understand how the ‘observed,’ and even more-so the ‘estimate’ is calculated, but know that it is a good compromise for concurvity between the worst and the best.

Next, we plot the partial plots.

Partial plots

gam_mod %>% draw(residuals = T)

Note below this plot is a rug plot.

Model investigation / hypothesis testing

GAMs are more descriptive rather than causal models.

They can also be used to create non-linear offsets or covariates to soak up noisy variation in regular models.

Thus, GAMs are more about taking away the variation to better explain natural processes, rather than being actual causal mechanisms or for answering important hypotheses themselves.

summary(gam_mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## y ~ s(x, k = 3)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   5.0000     0.2871   17.42  0.00293 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value  
## s(x) 1.949  1.997 10.88  0.0861 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.835   Deviance explained = 91.5%
## -REML = 6.0242  Scale est. = 0.41203   n = 5

When using GAMS, the partial plots always have the y-axis centred around zero. We will need to change this later. The reason is if we have multiple predictors, it becomes increasingly important to standardize to the average relationships across other covariates. We will have to tell it which value of the other variable(s) to assume in order to get back the y-axis later.

Summary outputs:

  • edf: estimated degrees of freedom. If close to 1, this would be a straight line.
  • Ref.df: ignore, used for a previous analysis that has since been removed
  • F-test: tests the hypothesis that the trend is NOT a straight line. If it is not, this will be significant. Another way of putting it; is the line significantly ‘wiggly?’
  • R-sq: a pseudo-^2, but better to go with deviance explained

Further analyses

d <- derivatives(gam_mod, order=1) # get first-order derivatives across the plot
d
draw(d)

To get the values of x where the data have the lowest absolute value (which for a first derivative = an inflection point), as well as the same thing but for the 95% lower and upper curves’ derivatives:

(d_ci <- d %>%
  summarise(val = data[which.min(abs(derivative))],
            lwr = data[which.min(abs(lower))],
            upr = data[which.min(abs(upper))]))
g1 + 
  geom_vline(data=d_ci, aes(xintercept = val), col="red") +
  geom_vline(data=d_ci, aes(xintercept = lwr), col="red", linetype = "dashed") +
  geom_vline(data=d_ci, aes(xintercept = upr), col="red", linetype = "dashed")

Peak is around 8.5, 95% CI is between 7.9 and 9.5.

Could also draw two lines, one that is a geom_segment(), which ends and then restarts the 95% CI limits, and another that is underneath, going the whole way around (geom_line())

Summary figures

gam_mod_list <- with(data_gam, list(x = modelr::seq_range(x, n=100)))
newdata <- emmeans(gam_mod, ~x, at = gam_mod_list) %>%
  as.data.frame %>%
  mutate(y = emmean, lwr = lower.CL, upr = upper.CL)
ggplot(newdata, aes(y = y, x = x)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr), fill = "blue", alpha = 0.3) +
  geom_line() +
  geom_point(data = data_gam) +
  geom_vline(data=d_ci, aes(xintercept = val), col="red") +
  geom_vline(data=d_ci, aes(xintercept = lwr), col="red", linetype = "dashed") +
  geom_vline(data=d_ci, aes(xintercept = upr), col="red", linetype = "dashed")

References